Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query. Such text-queried sound separation systems provide a natural and scalable interface for specifying arbitrary target sounds. However, supervised text-queried sound separation systems require costly labeled audio-text pairs for training. Moreover, the audio provided in existing datasets is often recorded in a controlled environment, causing a considerable generalization gap to noisy audio in the wild. In this work, we aim to approach text-queried universal sound separation by using only unlabeled data. We propose to leverage the visual modality as a bridge to learn the desired audio-textual correspondence. The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model, and the query vector is then used to condition an audio separation model to separate out the target sound. While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting, thanks to the joint language-image embedding learned by the CLIP model. Further, videos in the wild often contain off-screen sounds and background noise that may hinder the model from learning the desired audio-textual correspondence. To address this problem, we further propose an approach called noise invariant training for training a query-based sound separation model on noisy data. Experimental results show that the proposed models successfully learn text-queried universal sound separation using only noisy unlabeled videos, even achieving competitive performance against a supervised model in some settings.
translated by 谷歌翻译
Removing reverb from reverberant music is a necessary technique to clean up audio for downstream music manipulations. Reverberation of music contains two categories, natural reverb, and artificial reverb. Artificial reverb has a wider diversity than natural reverb due to its various parameter setups and reverberation types. However, recent supervised dereverberation methods may fail because they rely on sufficiently diverse and numerous pairs of reverberant observations and retrieved data for training in order to be generalizable to unseen observations during inference. To resolve these problems, we propose an unsupervised method that can remove a general kind of artificial reverb for music without requiring pairs of data for training. The proposed method is based on diffusion models, where it initializes the unknown reverberation operator with a conventional signal processing technique and simultaneously refines the estimate with the help of diffusion models. We show through objective and perceptual evaluations that our method outperforms the current leading vocal dereverberation benchmarks.
translated by 谷歌翻译
We propose an end-to-end music mixing style transfer system that converts the mixing style of an input multitrack to that of a reference song. This is achieved with an encoder pre-trained with a contrastive objective to extract only audio effects related information from a reference music recording. All our models are trained in a self-supervised manner from an already-processed wet multitrack dataset with an effective data preprocessing method that alleviates the data scarcity of obtaining unprocessed dry data. We analyze the proposed encoder for the disentanglement capability of audio effects and also validate its performance for mixing style transfer through both objective and subjective evaluations. From the results, we show the proposed system not only converts the mixing style of multitrack audio close to a reference but is also robust with mixture-wise style transfer upon using a music source separation model.
translated by 谷歌翻译
Score-based generative models learn a family of noise-conditional score functions corresponding to the data density perturbed with increasingly large amounts of noise. These perturbed data densities are tied together by the Fokker-Planck equation (FPE), a PDE governing the spatial-temporal evolution of a density undergoing a diffusion process. In this work, we derive a corresponding equation characterizing the noise-conditional scores of the perturbed data densities (i.e., their gradients), termed the score FPE. Surprisingly, despite impressive empirical performance, we observe that scores learned via denoising score matching (DSM) do not satisfy the underlying score FPE. We mathematically analyze three implications of satisfying the score FPE and a potential explanation for why the score FPE is not satisfied in practice. At last, we propose to regularize the DSM objective to enforce satisfaction of the score FPE, and show its effectiveness on synthetic data and MNIST.
translated by 谷歌翻译
传统上,音乐混合涉及以干净,单个曲目的形式录制乐器,并使用音频效果和专家知识(例如,混合工程师)将它们融合到最终混合物中。近年来,音乐制作任务的自动化已成为一个新兴领域,基于规则的方法和机器学习方法已被探索。然而,缺乏干燥或干净的仪器记录限制了这种模型的性能,这与专业的人造混合物相去甚远。我们探索是否可以使用室外数据,例如潮湿或加工的多轨音乐录音,并将其重新利用以训练有监督的深度学习模型,以弥合自动混合质量的当前差距。为了实现这一目标,我们提出了一种新型的数据预处理方法,该方法允许模型执行自动音乐混合。我们还重新设计了一种用于评估音乐混合系统的听力测试方法。我们使用经验丰富的混合工程师作为参与者来验证结果。
translated by 谷歌翻译
一个著名的矢量定量变分自动编码器(VQ-VAE)的问题是,学识渊博的离散表示形式仅使用代码书的全部容量的一小部分,也称为代码书崩溃。我们假设VQ-VAE的培训计划涉及一些精心设计的启发式方法,这是这个问题的基础。在本文中,我们提出了一种新的训练方案,该方案通过新颖的随机去量化和量化扩展标准VAE,称为随机量化变异自动编码器(SQ-VAE)。在SQ-VAE中,我们观察到一种趋势,即在训练的初始阶段进行量化是随机的,但逐渐收敛于确定性量化,我们称之为自宣传。我们的实验表明,SQ-VAE在不使用常见启发式方法的情况下改善了代码书的利用率。此外,我们从经验上表明,在视觉和语音相关的任务中,SQ-VAE优于VAE和VQ-VAE。
translated by 谷歌翻译
鉴于音乐源分离和自动混合的最新进展,在音乐曲目中删除音频效果是开发自动混合系统的有意义的一步。本文着重于消除对音乐制作中吉他曲目应用的失真音频效果。我们探索是否可以通过设计用于源分离和音频效应建模的神经网络来解决效果的去除。我们的方法证明对混合处理和清洁信号的效果特别有效。与基于稀疏优化的最新解决方案相比,这些模型获得了更好的质量和更快的推断。我们证明这些模型不仅适合倾斜,而且适用于其他类型的失真效应。通过讨论结果,我们强调了多个评估指标的有用性,以评估重建的不同方面的变形效果去除。
translated by 谷歌翻译
变异自动编码器(VAE)经常遭受后塌陷,这是一种现象,其中学习过的潜在空间变得无知。这通常与类似于数据差异的高参数有关。此外,如果数据方差不均匀或条件性,则确定这种适当的选择将变得不可行。因此,我们提出了具有数据方差的广义参数化的VAE扩展,并将最大似然估计纳入目标函数中,以适应解码器平滑度。由提议的VAE扩展产生的图像显示,MNIST和Celeba数据集上的Fr \'Echet Inception距离(FID)得到了改善。
translated by 谷歌翻译
Many e-commerce marketplaces offer their users fast delivery options for free to meet the increasing needs of users, imposing an excessive burden on city logistics. Therefore, understanding e-commerce users' preference for delivery options is a key to designing logistics policies. To this end, this study designs a stated choice survey in which respondents are faced with choice tasks among different delivery options and time slots, which was completed by 4,062 users from the three major metropolitan areas in Japan. To analyze the data, mixed logit models capturing taste heterogeneity as well as flexible substitution patterns have been estimated. The model estimation results indicate that delivery attributes including fee, time, and time slot size are significant determinants of the delivery option choices. Associations between users' preferences and socio-demographic characteristics, such as age, gender, teleworking frequency and the presence of a delivery box, were also suggested. Moreover, we analyzed two willingness-to-pay measures for delivery, namely, the value of delivery time savings (VODT) and the value of time slot shortening (VOTS), and applied a non-semiparametric approach to estimate their distributions in a data-oriented manner. Although VODT has a large heterogeneity among respondents, the estimated median VODT is 25.6 JPY/day, implying that more than half of the respondents would wait an additional day if the delivery fee were increased by only 26 JPY, that is, they do not necessarily need a fast delivery option but often request it when cheap or almost free. Moreover, VOTS was found to be low, distributed with the median of 5.0 JPY/hour; that is, users do not highly value the reduction in time slot size in monetary terms. These findings on e-commerce users' preferences can help in designing levels of service for last-mile delivery to significantly improve its efficiency.
translated by 谷歌翻译
In recent years, various service robots have been introduced in stores as recommendation systems. Previous studies attempted to increase the influence of these robots by improving their social acceptance and trust. However, when such service robots recommend a product to customers in real environments, the effect on the customers is influenced not only by the robot itself, but also by the social influence of the surrounding people such as store clerks. Therefore, leveraging the social influence of the clerks may increase the influence of the robots on the customers. Hence, we compared the influence of robots with and without collaborative customer service between the robots and clerks in two bakery stores. The experimental results showed that collaborative customer service increased the purchase rate of the recommended bread and improved the impression regarding the robot and store experience of the customers. Because the results also showed that the workload required for the clerks to collaborate with the robot was not high, this study suggests that all stores with service robots may show high effectiveness in introducing collaborative customer service.
translated by 谷歌翻译